Objective
In this notebook you will learn how to build and analyse a network built from a co-expression analysis. Other types of networks including mixed networks (metabolite, PPI, gene, drug) can be equally built and analyzed, differing only in the functional analysis.
Data
Download the data folder. You will start with the file data/gene_expression.tsv, which contains the TPMs of 843 genes, and 48 samples.
Software
This notebook relies on python's igraph for most of the analyses. Most, if not all, functions used here can also be applied with R's igraph. Other packages exist for network analysis including networkx and graph-tool. Snap.py is also a good alternative for large networks.
An introduction to network analysis in igraph in R can be found here.
Instructions
Further instructions are found here.
import pandas as pd
import numpy as np
import igraph as ig
import sklearn
import sklearn.neighbors
import leidenalg
import scipy as sp
from statsmodels.stats.multitest import multipletests
import itertools
import gseapy as gp
# Plotting
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
You will use a gene expression dataset (RNAseq, expressed as TPM) from a disease group with 48 samples. To keep analysis memory and time requirements reasonable for this lesson, we have selected 843 genes. We will assume that all the samples are viable for comparison and that no technical artifacts are present.
data=pd.read_csv('data/gene_expression.tsv', sep="\t")
data.head()
We will use the annotations from the Human Protein Atlas to map the above Ensembl ids to gene names. Ensembl ids with no name are dropped.
patlas=pd.read_csv('data/proteinatlas.tsv', sep="\t").loc[:,['Ensembl','Gene']]
data=pd.merge(data, patlas, on='Ensembl', how='inner')
No gene names are duplicated in this case, otherwise we would need to get an unique set of genes.
any(data.Gene.duplicated())
data=data.set_index('Gene')
data=data.loc[:,data.columns!='Ensembl']
data.shape #755 genes, 48 samples
data.head()
We will start by filtering samples by removing genes whose expression is consistently low (median TPM < 1)
proc_data=data.copy()
proc_data=proc_data.loc[proc_data.median(1)>1]
proc_data.shape #499 genes
A very quick view shows that several gene clusters are found, including two major groups. However, the analysis below does not perform any statistical filtering.
gene_correlations=proc_data.T.corr()
sns.clustermap(gene_correlations);
We will save this initial filtered dataset for further analysis in the sections below.
proc_data.to_csv('data/filtered_gene_expression.tsv', sep="\t")
Our initial network analysis will be performed on a co-expression network using Spearman correlations. Because this network has a big chance of producing false positives we will consider Bonferroni correction to control for familywise error, as well as FDR.
#Columns are samples, rows are genes
data=pd.read_csv('data/filtered_gene_expression.tsv', sep="\t", index_col=0)
An important consideration when performing correlation analysis is how to deal with NA.
This will depend on the type of that that you have, and will severely affect downstream results. It is thus important to carefully think about this.
data.isna().any().any() #we have no rows with NA
We will perform a gene-gene correlation analysis by computing pairwise Spearman correlations. Choosing other non-parametric (Kendall or Spearman) vs parametric (Pearson) methods depends on your data. Here, because we have a relatively small sample size and want to save on computational time we choose Spearman.
The following takes a few minutes to run on a normal laptop, so you may instead load the resulting correlations table.
#Correlation and P val matrices
Rmatrix, Pmatrix= sp.stats.spearmanr(data.T)
Rmatrix=pd.DataFrame(Rmatrix, index=data.index.copy(), columns=data.index.copy())
sns.clustermap(Rmatrix, cmap="RdBu_r", center=0); #resulting R matrix, similar to that seen in the section above
If we look at the matrix of P values, we can already see that many of the correlations in the top right columns are not significant even before multiple hypothesis correction:
Pmatrix=pd.DataFrame(Pmatrix, index=data.index.copy(), columns=data.index.copy())
sns.clustermap(Pmatrix); #resulting R matrix, similar to that seen in the section above
We will now adjust the P values based on the (big) number of comparisons done. For just 499 genes, we have a total of $499^2 = 249001$ correlations. However, these numbers consider that the same correlation is computed twice (gene A vs gene B, and gene B vs gene A). If we were to include all correlations, we would thus be including many repeated analyses. The true number of comparisons is half of that above, minus the correlation of a gene with itself.
This means $\frac{499!}{2!(499-2)!} = 124251$ correlations. At an error rate of 0.05, this means that the probability of finding at least one false positive is nearly 100%: $1-0.95^{124251} \approx 1$. We thus need to correct P values. We will apply a Bonferroni correction as well as a FDR correction.
###Gets the upper triangular matrix
Psquared=Pmatrix.where(np.triu(np.ones(Pmatrix.shape),1).astype(np.bool))
Psquared.columns.name='Gene2'
#R matrix manpulation to prepare for merging with P matrix
Rmatrix=Rmatrix.stack()
Rmatrix.index.names=['v1','v2'] #Avoid stacked names colliding
Rmatrix=Rmatrix.reset_index()
Rmatrix.columns=['gene1','gene2','R']
#P matrix manpulation to prepare for merging with P matrix
Pmatrix=Pmatrix.stack()
Pmatrix.index.names=['v1','v2']
Pmatrix=Pmatrix.reset_index()
Pmatrix.columns=['gene1','gene2','P']
PRmatrix=pd.merge(Rmatrix, Pmatrix, on=['gene1','gene2']) #Correlation matrix with both R and P
#Multiple hypothesis correction computed on the P column
adjP=pd.DataFrame(multipletests(PRmatrix['P'], method='bonferroni', alpha=0.01)[1], columns=['Padj'])
FDR=pd.DataFrame(multipletests(PRmatrix['P'], method='fdr_bh', alpha=0.01)[1], columns=['FDR'])
PRmatrix=pd.concat([ PRmatrix, adjP], axis=1)
PRmatrix=pd.concat([ PRmatrix, FDR], axis=1)
# PRmatrix.to_csv('data/coexpression_matrix.tsv', sep="\t", index=False) #saves the matrix
PRmatrix.head()
Considering the Bonferroni correction we find 37663 correlations that are statistically significant at an $\alpha = 0.01$. If we consider instead FDR as correction method, we find 144953, which at a FDR of 0.01 implies 0.01 \times 144953 = 1141 false positives.
sum(PRmatrix.Padj<0.01)
sum(PRmatrix.FDR<0.01)
Let's add two additional columns, where we assign R=0 for those correlations that are not statistically significant (adjusted P or FDR > 0.01).
PRmatrix.loc[:,'R (padj)']=PRmatrix['R'].copy()
PRmatrix.loc[:,'R (fdr)']=PRmatrix['R'].copy()
PRmatrix.loc[PRmatrix['Padj']>0.01,'R (padj)']=0
PRmatrix.loc[PRmatrix['FDR']>0.01,'R (fdr)']=0
Now we can see how the initial heatmap of correlations looks.
#Transforming to a squared matrix again
PRQ=pd.concat([
PRmatrix.copy(),
PRmatrix.copy().rename(columns={'gene1':'gene2','gene2':'gene1'}).loc[:,PRmatrix.columns]
]).drop_duplicates()
Rmatrix_padj=PRQ.copy().pivot(index='gene1',columns='gene2',values='R (padj)')
Rmatrix_fdr=PRQ.copy().pivot(index='gene1',columns='gene2',values='R (fdr)')
Rmatrix_padj=Rmatrix_padj.loc[Rmatrix_padj.index,Rmatrix_padj.index]
Rmatrix_fdr=Rmatrix_fdr.loc[Rmatrix_padj.index,Rmatrix_padj.index]
g=sns.clustermap(Rmatrix_padj, cmap="RdBu_r", center=0);
g.fig.suptitle('R (filtered with Bonferroni)');
plt.show()
g=sns.clustermap(Rmatrix_fdr, cmap="RdBu_r", center=0);
g.fig.suptitle('R (filtered with FDR)');
plt.show()
The plots above show that the Bonferroni correction is only selecting very high (absolute) correlations. This may remove false positives, but it may also remove slightly weaker correlations that are biologically relevant. The Bonferroni correction also removes most of the negatively-coexpressed genes. Notice this from the distribution of correlation coefficients:
shortPR=PRmatrix.copy().loc[:,['gene1','gene2','R (padj)','R (fdr)']]
shortPR=shortPR.loc[shortPR.gene1!=shortPR.gene2]
fig=plt.figure(figsize=(8,4))
sns.distplot(shortPR['R (padj)'][shortPR['R (padj)']!=0]);
sns.distplot(shortPR['R (fdr)'][shortPR['R (fdr)']!=0]);
fig.legend(labels=['bonferroni (<0.01)','fdr (<0.01)']);
plt.xlabel('R')
plt.ylabel('Probability Density Function')
plt.title('Distribution of correlation coefficients')
plt.show()
We can also observe that the Bonferroni correction yields a more homogeneous number of co-expressed genes for each gene (i.e. first neighbors), compared to the FDR filtering. Notice how for a subset of genes many correlations are identified (>300/498).
gene_associations_padj=pd.concat([
shortPR.copy().loc[shortPR['R (padj)']!=0,][['gene1','gene2']],
shortPR.copy().loc[shortPR['R (padj)']!=0,][['gene2','gene1']].rename(columns={'gene1':'gene2','gene2':'gene1'})]).drop_duplicates().groupby('gene1').agg('count')
gene_associations_fdr=pd.concat([
shortPR.copy().loc[shortPR['R (fdr)']!=0,][['gene1','gene2']],
shortPR.copy().loc[shortPR['R (fdr)']!=0,][['gene2','gene1']].rename(columns={'gene1':'gene2','gene2':'gene1'})]).drop_duplicates().groupby('gene1').agg('count')
gene_associations=pd.concat([gene_associations_padj, gene_associations_fdr],1, sort=True)
gene_associations.fillna(0,inplace=True)
gene_associations.columns=['bonferroni','fdr']
fig=plt.figure(figsize=(8,4))
sns.boxplot( data=gene_associations, notch=True);
plt.title('Number of associations per gene')
plt.ylabel('#')
plt.show()
In building a graph from a co-expression network:
We will now build 2 different networks to analyse further:
# Prepares table for being read by igraph
PRmatrix=pd.read_csv('data/coexpression_matrix.tsv', sep="\t")
PRmatrix.loc[PRmatrix['FDR']>0.01,'R (fdr)']=0
PRmatrix=PRmatrix.loc[PRmatrix['R (fdr)']!=0,['gene1','gene2','R (fdr)']]
PRmatrix=PRmatrix.loc[PRmatrix.gene1!=PRmatrix.gene2] #drops self correlations
fdr_pos_mat=PRmatrix.loc[PRmatrix['R (fdr)']>0]
fdr_neg_mat=PRmatrix.loc[PRmatrix['R (fdr)']<0]
We will now build the k-NNG, using gene-gene distances as input to determine the nearest neighbours. Because data is expressed in comparable units (TPM), we can go ahead and build the network from the distances between features. If features were not comparable because they would come from different datasets (e.g. metabolomic and proteomic), they would need to be normalized. We'll select the top 50 neighbours for each gene.
#Computes a kNN adjacency matrix from the input dataset
#and prepares table for being read by igraph
data=pd.read_csv('data/filtered_gene_expression.tsv', sep="\t", index_col=0)
knnG=sklearn.neighbors.kneighbors_graph(data, 50, metric='euclidean')
knnG=pd.DataFrame(knnG.toarray(), columns=data.index.copy(), index=data.index.copy()) #adjacency matrix
knnG.index.name='gene1'
knnG.columns.name='gene2'
knnG=knnG.stack().reset_index().rename(columns={0:'Connectivity'})
knnG=knnG.loc[knnG['Connectivity']!=0]
We will now generate the graphs from the dataframes above
#positive coexpressions, weighted
pos_w=ig.Graph.TupleList([tuple(x) for x in fdr_pos_mat.values], directed=False, edge_attrs=['w'])
#full network, unweighted
all_u=ig.Graph.TupleList([tuple(x) for x in PRmatrix.values], directed=False)
#knnG, unweighted
knn=ig.Graph.TupleList([tuple(x) for x in knnG.values], directed=False)
#random network, unweighted, node and edge number based on overall network
random=ig.Graph.Erdos_Renyi(n=499, m=PRmatrix.shape[0], directed=False, loops=False)
For representation purposes we will see how the knn graph looks - be careful in drawing the others, as they have many more edges it becomes computationally heavy.
#This plots each graph, using degree to present node size:
knn.vs['degree']=knn.degree()
knn.vs['degree_size']=[(x*20)/(max(knn.vs['degree'])) for x in knn.vs['degree']] #degree is multiplied by 10 so that we can see all nodes
layout = knn.layout_mds()
ig.plot(knn, layout=layout, vertex_color='white', edge_color='silver', vertex_size=knn.vs['degree_size'])
We can see that while the same number of nodes is found in all networks, the number of edges varies greatly. We also see that the network is fully connected, which is not allways the case. If it wasn't connected, we could select the k largest connected components, and proceed the analyses with them. The largest connected component is called the giant component.
#function to get graph properties
def graph_prop(input_graph):
ncount=nn.vcount()
ecount=nn.ecount()
diameter=nn.diameter()
av_path=nn.average_path_length()
dens=nn.density()
clustering=nn.transitivity_undirected() #this is the global clustering coefficient
conn=nn.is_connected()
min_cut=nn.mincut_value()
out=pd.DataFrame([ncount, ecount, diameter, av_path, dens, clustering, conn, min_cut],
index=['node_count','edge_count','diameter','av_path_length','density','clustering_coef','connected?','minimum_cut']).T
return(out)
#summarizing graph properties
network_stats=pd.DataFrame()
for nn in [pos_w, all_u, knn, random]:
network_stats=pd.concat([network_stats,graph_prop(nn)])
network_stats.index=['pos_w','all_u','knn','random']
network_stats
We'll look into different centrality measures:
Degree, Betweenness and Closeness are additionally computed for the positive coexpression network by taking into account each edge's weight. For instance, for degree this is done for each node by summing each edge's degree.
Because the number of shortest paths in a network scales with the network size, we normalize Eccentricity and Betweenness with respect to the network size so that they can be compared between the four networks above.
Degree distribution
Let's start by comparing the degrees of the random network against the three other networks. From the figures below it seems that there is no relationship between the degree of the random network, and any of the three others.
fig, axes=plt.subplots(nrows=1, ncols=3, figsize=(15,5), sharey='row')
for i, ax in zip(range(4), axes.flat):
sns.regplot([pos_w,all_u,knn][i].degree(), random.degree(), ax=ax);
ax.set_title(['pos','all','knn'][i])
ax.set(xlabel='k ('+['pos','all','knn'][i]+')', ylabel='k (random)')
def transform_degree(graph):
alldegs=graph.degree()
alldegs=pd.DataFrame([[key,len(list(group))] for key,group in itertools.groupby(alldegs)], columns=['k','count'])
alldegs['P(k)']=[x/alldegs['count'].sum() for x in alldegs['count']]
alldegs=alldegs.loc[:,['k','P(k)']]
alldegs.drop_duplicates(inplace=True)
alldegs.reset_index(drop=True, inplace=True)
return(alldegs)
fig, ax = plt.subplots(figsize=(7, 7))
# ax.set(yscale='log', xscale='log')
p=sp.stats.probplot(pos_w.degree(), plot=ax)
a=sp.stats.probplot(all_u.degree(), plot=ax)
k=sp.stats.probplot(knn.degree(), plot=ax)
r=sp.stats.probplot(random.degree(), plot=ax)
col=['blue','','lightsalmon','','green','','red']
for x in np.arange(0,7,2):
ax.get_lines()[x].set_markerfacecolor(col[x])
ax.get_lines()[x].set_markeredgewidth(0)
ax.get_lines()[x+1].set_color(col[x])
fig.legend(labels=['pos','pos','all','all','knn','knn','random','random']);
ax.set(xlabel='Data quantiles', ylabel='observed degree (k)')
plt.show()
Centrality
We will now compare different centrality metrics between the graphs.
#function to compute different centrality measures
def combine_raw_centralities():
def centrality_raw(input_graph, graph_name):
deg=input_graph.degree()
ecc=input_graph.eccentricity()
node_n=input_graph.vcount()
btw=[(2*x / ((node_n-1)*(node_n-2))) for x in input_graph.betweenness(directed=False)] #scaled to account for network size
cls=input_graph.closeness(normalized=True)
#if edge weight is defined, compute weighted centralities
#otherwise, return a vector of 0
if('w' in input_graph.es.attribute_names()):
deg_w=input_graph.strength(weights='w') #this is a weighted degree
btw_w=[(2*x / ((node_n-1)*(node_n-2))) for x in input_graph.betweenness(directed=False, weights='w')]
cls_w=input_graph.closeness(weights='w', normalized=True)
else:
btw_w=np.repeat(0,input_graph.vcount())
deg_w=np.repeat(0,input_graph.vcount())
cls_w=np.repeat(0,input_graph.vcount())
out=pd.DataFrame(
[deg, ecc, btw, cls, deg_w, btw_w, cls_w],
index=['degree','eccentricity','betweeness', 'closeness',
'deg_w','btw_w','cls_w']).T
out['graph']=graph_name
out=out.loc[:,np.append(['graph'],out.columns[out.columns!='graph'])]
return(out)
#Computes centralities for all networks
network_centralities_raw=pd.DataFrame()
network_centralities_raw=pd.concat([network_centralities_raw,centrality_raw(pos_w,'pos_w')])
network_centralities_raw=pd.concat([network_centralities_raw,centrality_raw(all_u,'all_u')])
network_centralities_raw=pd.concat([network_centralities_raw,centrality_raw(knn,'knn')])
network_centralities_raw=pd.concat([network_centralities_raw,centrality_raw(random,'random')])
return(network_centralities_raw)
network_centralities_raw=combine_raw_centralities()
fig, axes=plt.subplots(nrows=4, ncols=2, figsize=(16,20), sharey='row')
cen=['degree','deg_w','betweeness', 'btw_w', 'closeness','cls_w','eccentricity']
for i, ax in zip(range(10), axes.flat):
if(i==7):
fig.delaxes(ax)
break
sns.boxplot(data=network_centralities_raw, x='graph', y=cen[i], #outliers are hidden from plotting
notch=True, ax=ax, showfliers=False)
ax.set_title(cen[i])
ax.set(xlabel='')
Overall, we see:
Full network > Positive coexp. network > kNN-G.kNN-G > Full network ~ Positive coexp. network.Full network > Positive coexp. network > kNN-G.Positive coexp. network and for the Full network, but nodes in the latter tend to display lower eccentricity. In turn, most nodes in the kNN-G tend to display an eccentricity of 4-5.To identify communities, we will use the Leiden algorithm. This method is similar to the Louvain method but fixes different intrinsic problems it has.
knn.vs['degree']=knn.degree()
knn.vs['degree_size']=[(x*20)/(max(knn.vs['degree'])) for x in knn.vs['degree']] #degree is multiplied by 10 so that we can see all nodes
partition = leidenalg.find_partition(knn, leidenalg.ModularityVertexPartition)
layout = knn.layout_mds()
ig.plot(partition, layout=layout, edge_color='silver', vertex_size=knn.vs['degree_size'])
We will perform the community analysis on the 4 networks. We will perform one additional community analysis by considering the edge weights from the positively coexpressed network. Importantly, this method searches for the largest possible communities for our network, which may not always be the desired. Alternative models such as the Constant Potts Model allow you to identify smaller communities. Should we know that our data has special feature classes, we can compare whether the communities identify those classes by examining them individually, and increasing the resolution if needed.
pos_comm = leidenalg.find_partition(pos_w, leidenalg.ModularityVertexPartition)
pos_w_comm = leidenalg.find_partition(pos_w, leidenalg.ModularityVertexPartition, weights='w')
all_comm = leidenalg.find_partition(all_u, leidenalg.ModularityVertexPartition)
knn_comm = leidenalg.find_partition(knn, leidenalg.ModularityVertexPartition)
random_comm = leidenalg.find_partition(random, leidenalg.ModularityVertexPartition)
Predictably, we can see that the modularity score of any of the networks is substantially larger than that of the random network.
np.round(pos_comm.modularity,3)
np.round(pos_w_comm.modularity,3)
np.round(all_comm.modularity,3)
np.round(knn_comm.modularity,3)
np.round(random_comm.modularity,3)
Comparing the different communities by size:
#Compiles gene lists per community
def get_community_table():
comm_counts=pd.DataFrame()
gene_lists=pd.DataFrame()
for i in [0,1,2,3]:
graph=[pos_w,pos_w,all_u,knn][i]
comm=[pos_comm,pos_w_comm,all_comm,knn_comm][i]
name=['pos','pos_w','all','knn'][i]
temp=pd.DataFrame(list(zip(graph.vs['name'],[x+1 for x in comm.membership]))).rename(columns={0:'gene',1:'community'})
counts=pd.DataFrame(temp.groupby('community')['gene'].agg(len))
counts.columns=[name]
comm_counts=pd.concat([comm_counts, counts],1)
gl=pd.DataFrame(temp.groupby('community')['gene'].apply(list)).reset_index()
gl['community']=['c'+str(i) for i in gl['community']]
gl['network']=name
gl=gl.loc[:,['network','community','gene']]
gene_lists=pd.concat([gene_lists, gl])
comm_counts.index=['c'+str(i) for i in comm_counts.index]
return([comm_counts,gene_lists])
#Plotting community sizes
import seaborn as sns
fig, ax = plt.subplots(figsize=(7, 4))
get_community_table()[0].fillna(0).T.plot(kind='bar', stacked=True, ax=ax);
ax.legend(get_community_table()[0].index, loc='right', bbox_to_anchor=(1.15, 1));
ax.set_title('Community size')
plt.xticks(rotation=0)
plt.show()
Functional analysis
In order to perform functional enrichment, we will extract each of the communities and perform a hypergeometric test to understand whether they are particularly enriched in specific biological functions.
We will use enrichr to perform the gene set enrichment analysis. As background we will use the full list of genes that were quantified.
We will look at 3 gene set libraries. Should you have other kinds of data, enrichr allows you to define your own feature sets and perform a similar analysis. The challenge is in identifying comprehensive and well curated gene sets.
#All the available human libraries
gp.get_library_name(database='Human')[:15]
#we will search 3 libraries for significantly enriched gene sets
gene_sets=['GO_Biological_Process_2018','KEGG_2019_Human','OMIM_Disease']
background=all_u.copy().vs['name']
def perform_enrich(network):
temp=get_community_table()[1].copy()
temp=temp.loc[temp['network']==network]
output_enrichr=pd.DataFrame()
for comm in temp['community'].values:
gl=list(temp.loc[temp['community']==comm, 'gene'])[0]
for bp in gene_sets:
print('Analyzing '+network+' | Comm: '+comm+'/'+str(len(temp.index))+' | BP: '+bp)
enr=gp.enrichr(
gene_list=gl,
gene_sets=bp,
background=background,
outdir='data/Enrichr',
format='png'
)
results=enr.results.sort_values('Adjusted P-value', ascending=True)
results=results.loc[results['Adjusted P-value']<0.05,]
results['BP']=bp
results['Comm']=comm
results['Graph']=network
output_enrichr=pd.concat([output_enrichr, results])
return(output_enrichr)
all_enriched=pd.DataFrame()
for net in ['pos', 'pos_w', 'all', 'knn']:
all_enriched=pd.concat([all_enriched,perform_enrich(net)])
Running the command above not only gives you the results after significance testing (Q<0.05), but it also outputs some preliminary barplots with the statistically significant results (found under /data/Enrichr/). For instance:

enriched_terms=all_enriched.loc[:,['Graph','Comm','Term','Adjusted P-value']].copy()
enriched_terms['Adjusted P-value']=-1*np.log10(enriched_terms['Adjusted P-value'])
enriched_terms.head()
fig, ax = plt.subplots(figsize=(7, 4))
data_bars=pd.DataFrame(enriched_terms.groupby(['Graph','Comm'])['Term'].agg('count')).stack().reset_index().rename(columns={0:'Count'})
sns.barplot(x='Graph', y='Count', data=data_bars, hue='Comm')
ax.set_title('Number of significant Terms (Q < 0.05) per community')
ax.legend(loc='right', bbox_to_anchor=(1.15, 1));
plt.xticks(rotation=0)
plt.show()
Note that some of these communities are very big, which explains the big number of biological processes found above.
#Number of genes/community
get_community_table()[0].fillna(0).T
Some communities are not enriched in any biological process
pd.DataFrame(enriched_terms.groupby(['Graph','Comm'])['Term'].agg('count'))
We can now find whether the Full Network, Positively co-expressed, and Positively co-expressed weighted, show any common terms among their biggest communities. We do not compare with kNN-G as this shows very homogeneous and different communities than the other two networks
#Finding consensus
temp=enriched_terms.copy()
temp['comm_term']=temp.Comm+'_'+temp.Term
temp=temp.loc[:,['Graph','comm_term']]
consensus=pd.DataFrame()
consensus=pd.concat([consensus, temp.loc[temp['Graph']=='pos']])
consensus=pd.merge(consensus,
temp.loc[temp['Graph']=='pos_w'], on="comm_term", how='outer', suffixes=['pos','pos_w'])
consensus=pd.merge(consensus,
temp.loc[temp['Graph']=='all'], on="comm_term", how='outer').rename(columns={'Graph':'all'})
consensus=consensus.loc[consensus.isna().sum(1)==0].loc[:,['comm_term','Graphpos','Graphpos_w','all']]
Among the biggest communities we find several biological processes (55) that are simultaneously identified in the same community in the three graphs (Full, Pos coexp, and Pos coexp weighted).
consensus['comm']=[x[0] for x in consensus.comm_term.str.split('\_')]
consensus.groupby('comm')['comm_term'].agg('count')
Questions
c3 in the weighted co-expression graph, present a decent size?We will now extract the communities of genes in order to use them in further analyses in other sections.
#Requires running the community detection from the previous section
patlas=pd.read_csv('data/proteinatlas.tsv', sep="\t").loc[:,['Ensembl','Gene']]
#Compiles gene lists per community. We need Ensembl ids for further analyses
def get_ensembl():
comm_counts=pd.DataFrame()
gene_lists=pd.DataFrame()
for i in [0,1,2,3]:
graph=[pos_w,pos_w,all_u,knn][i]
comm=[pos_comm,pos_w_comm,all_comm,knn_comm][i]
name=['pos','pos_w','all','knn'][i]
temp=pd.DataFrame(list(zip(graph.vs['name'],[x+1 for x in comm.membership]))).rename(columns={0:'gene',1:'community'})
gl=pd.DataFrame(temp.groupby('community')['gene'].apply(list)).reset_index()
gl['community']=['c'+str(i) for i in gl['community']]
gl['network']=name
gl=gl.loc[:,['network','community','gene']]
gene_lists=pd.concat([gene_lists, gl])
gene_communities=gene_lists
gene_mat=pd.DataFrame()
for net in gene_communities['network'].unique():
temp=gene_communities.copy().loc[gene_communities['network']==net,]
for comm in temp['community'].unique():
gl=list(temp.copy().loc[temp['community']==comm,'gene'])[0]
el=[patlas.loc[patlas['Gene']==x,'Ensembl'].iloc[0] for x in gl if x in patlas['Gene'].values]
df=pd.DataFrame([net,comm,gl,el]).T
df.columns=['network','community','Gene','Ensembl']
gene_mat=pd.concat([gene_mat, df])
return(gene_mat)
get_ensembl().to_csv('data/gene_communities.tsv', sep="\t", index=False)
Here we've performed some network analyses based on a co-expression network. We've explored different centrality measures to characterize the networks, and identified the communities of genes in these networks. We have also used gene set enrichment analysis to characterize these communities.